Information about the data set

This dataset consists of 2.240 customers of supermarket XYZ with data on: Profile of customers, consumer habits, campaign performance and channel preferences.

It was obtained from the Kaggle website. Refer to the following link for the source: https://www.kaggle.com/datasets/jackdaoud/marketing-data?resource=download

Data dictionary

  • AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise.
  • AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise.
  • AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise.
  • AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise.
  • AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise.
  • Response - 1 if customer accepted the offer in the last campaign, 0 otherwise.
  • Year_Birth - Birth Year of customer.
  • DT_Customer – Date of Customers enrollment with the company
  • Complain - 1 if customer complained in the last 2 years.
  • Education - customer’s level of education.
  • Marital_Status - customer’s marital status.
  • Kidhome - number of small children in customer’s household.
  • Teenhome - number of teenagers in customer’s household.
  • Income - customer’s yearly household income.
  • MntFishProducts - amount spent on fish products in the last 2 years.
  • MntMeatProducts - amount spent on meat products in the last 2 years.
  • MntFruits - amount spent on fruits in the last 2 years.
  • MntSweetProducts - amount spent on sweet products in the last 2 years.
  • MntWines - amount spent on wines in the last 2 years.
  • MntGoldProds - amount spent on gold products in the last 2 years.
  • NumDealsPurchases - number of purchases made with discount.
  • NumCatalogPurchases - number of purchases made using catalogue.
  • NumStorePurchases - number of purchases made directly in stores.
  • NumWebPurchases - number of purchases made through company’s web site.
  • NumWebVisitsMonth - number of visits to company’s web site in the last month.
  • Recency - number of days since the last purchase.

Task description

The supermarket XYZ wants to understand the impact of the marketing campaigns they were performing and optimize the selection of best campaigns. In this direction, they want to predict if the customer will accept the following campaign or not, in order to save money spent on non-profitable marketing actions.

Also, the supermarket wants to better understand their customers in order to address their needs and offer relevant products for each of them. To really understand the customers, a precise profiling of each customer segment must be performed. Additionally, the management wants to derive some insights regarding customer’s segments consumption habits.

Exploratory analysis of the dataset

Example of what data looks like:

head(dataset) 
#>   ï..ID Year_Birth  Education Marital_Status Income Kidhome Teenhome
#> 1  5524       1957 Graduation         Single  58138       0        0
#> 2  2174       1954 Graduation         Single  46344       1        1
#> 3  4141       1965 Graduation       Together  71613       0        0
#> 4  6182       1984 Graduation       Together  26646       1        0
#> 5  5324       1981        PhD        Married  58293       1        0
#> 6  7446       1967     Master       Together  62513       0        1
#>   Dt_Customer Recency MntWines MntFruits MntMeatProducts MntFishProducts
#> 1  2012-09-04      58      635        88             546             172
#> 2  2014-03-08      38       11         1               6               2
#> 3  2013-08-21      26      426        49             127             111
#> 4  2014-02-10      26       11         4              20              10
#> 5  2014-01-19      94      173        43             118              46
#> 6  2013-09-09      16      520        42              98               0
#>   MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
#> 1               88           88                 3               8
#> 2                1            6                 2               1
#> 3               21           42                 1               8
#> 4                3            5                 2               2
#> 5               27           15                 5               5
#> 6               42           14                 2               6
#>   NumCatalogPurchases NumStorePurchases NumWebVisitsMonth AcceptedCmp3
#> 1                  10                 4                 7            0
#> 2                   1                 2                 5            0
#> 3                   2                10                 4            0
#> 4                   0                 4                 6            0
#> 5                   3                 6                 5            0
#> 6                   4                10                 6            0
#>   AcceptedCmp4 AcceptedCmp5 AcceptedCmp1 AcceptedCmp2 Complain Z_CostContact
#> 1            0            0            0            0        0             3
#> 2            0            0            0            0        0             3
#> 3            0            0            0            0        0             3
#> 4            0            0            0            0        0             3
#> 5            0            0            0            0        0             3
#> 6            0            0            0            0        0             3
#>   Z_Revenue Response
#> 1        11        1
#> 2        11        0
#> 3        11        0
#> 4        11        0
#> 5        11        0
#> 6        11        0

Data types

Mostly int(integer), there are 3 variables that are chr(character), two correspond to categories and one to date.

str(dataset) 
#> 'data.frame':    2240 obs. of  29 variables:
#>  $ ï..ID              : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
#>  $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
#>  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
#>  $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
#>  $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
#>  $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
#>  $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
#>  $ Dt_Customer        : chr  "2012-09-04" "2014-03-08" "2013-08-21" "2014-02-10" ...
#>  $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...
#>  $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
#>  $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
#>  $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
#>  $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
#>  $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
#>  $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
#>  $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
#>  $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
#>  $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
#>  $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
#>  $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
#>  $ AcceptedCmp3       : int  0 0 0 0 0 0 0 0 0 1 ...
#>  $ AcceptedCmp4       : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ AcceptedCmp5       : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ AcceptedCmp1       : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ AcceptedCmp2       : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Z_CostContact      : int  3 3 3 3 3 3 3 3 3 3 ...
#>  $ Z_Revenue          : int  11 11 11 11 11 11 11 11 11 11 ...
#>  $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...

Data statistics

Check for NA values:

  • Only variable Income has 24 NA values, they can be removed without impacting the data base as it is a 1%.

Check for outliers:

  • Year_Birth: the min is 1893 which to current day is 130 years, it my be a wrong input.
  • Income: has a max of 666.666 while median is 51382. It is an outlier that can be removed.
  • MntWines and MntMeatProducts have quite large numbers as max, that may be pulling the mean up. The outliers can be removed.
  • Variables Z_CostContact and Z_Revenue only have 1 value, they are do not provide any information and will be dropped. Also, variables i..ID and Dt_Customer will be dropped.
summary(dataset) 
#>      ï..ID         Year_Birth    Education         Marital_Status    
#>  Min.   :    0   Min.   :1893   Length:2240        Length:2240       
#>  1st Qu.: 2828   1st Qu.:1959   Class :character   Class :character  
#>  Median : 5458   Median :1970   Mode  :character   Mode  :character  
#>  Mean   : 5592   Mean   :1969                                        
#>  3rd Qu.: 8428   3rd Qu.:1977                                        
#>  Max.   :11191   Max.   :1996                                        
#>                                                                      
#>      Income          Kidhome          Teenhome      Dt_Customer       
#>  Min.   :  1730   Min.   :0.0000   Min.   :0.0000   Length:2240       
#>  1st Qu.: 35303   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
#>  Median : 51382   Median :0.0000   Median :0.0000   Mode  :character  
#>  Mean   : 52247   Mean   :0.4442   Mean   :0.5062                     
#>  3rd Qu.: 68522   3rd Qu.:1.0000   3rd Qu.:1.0000                     
#>  Max.   :666666   Max.   :2.0000   Max.   :2.0000                     
#>  NA's   :24                                                           
#>     Recency         MntWines         MntFruits     MntMeatProducts 
#>  Min.   : 0.00   Min.   :   0.00   Min.   :  0.0   Min.   :   0.0  
#>  1st Qu.:24.00   1st Qu.:  23.75   1st Qu.:  1.0   1st Qu.:  16.0  
#>  Median :49.00   Median : 173.50   Median :  8.0   Median :  67.0  
#>  Mean   :49.11   Mean   : 303.94   Mean   : 26.3   Mean   : 166.9  
#>  3rd Qu.:74.00   3rd Qu.: 504.25   3rd Qu.: 33.0   3rd Qu.: 232.0  
#>  Max.   :99.00   Max.   :1493.00   Max.   :199.0   Max.   :1725.0  
#>                                                                    
#>  MntFishProducts  MntSweetProducts  MntGoldProds    NumDealsPurchases
#>  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   
#>  1st Qu.:  3.00   1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000   
#>  Median : 12.00   Median :  8.00   Median : 24.00   Median : 2.000   
#>  Mean   : 37.53   Mean   : 27.06   Mean   : 44.02   Mean   : 2.325   
#>  3rd Qu.: 50.00   3rd Qu.: 33.00   3rd Qu.: 56.00   3rd Qu.: 3.000   
#>  Max.   :259.00   Max.   :263.00   Max.   :362.00   Max.   :15.000   
#>                                                                      
#>  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
#>  Min.   : 0.000   Min.   : 0.000      Min.   : 0.00     Min.   : 0.000   
#>  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.00     1st Qu.: 3.000   
#>  Median : 4.000   Median : 2.000      Median : 5.00     Median : 6.000   
#>  Mean   : 4.085   Mean   : 2.662      Mean   : 5.79     Mean   : 5.317   
#>  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.00     3rd Qu.: 7.000   
#>  Max.   :27.000   Max.   :28.000      Max.   :13.00     Max.   :20.000   
#>                                                                          
#>   AcceptedCmp3      AcceptedCmp4      AcceptedCmp5      AcceptedCmp1    
#>  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
#>  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
#>  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
#>  Mean   :0.07277   Mean   :0.07455   Mean   :0.07277   Mean   :0.06429  
#>  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
#>  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
#>                                                                         
#>   AcceptedCmp2        Complain        Z_CostContact   Z_Revenue 
#>  Min.   :0.00000   Min.   :0.000000   Min.   :3     Min.   :11  
#>  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:3     1st Qu.:11  
#>  Median :0.00000   Median :0.000000   Median :3     Median :11  
#>  Mean   :0.01339   Mean   :0.009375   Mean   :3     Mean   :11  
#>  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:3     3rd Qu.:11  
#>  Max.   :1.00000   Max.   :1.000000   Max.   :3     Max.   :11  
#>                                                                 
#>     Response     
#>  Min.   :0.0000  
#>  1st Qu.:0.0000  
#>  Median :0.0000  
#>  Mean   :0.1491  
#>  3rd Qu.:0.0000  
#>  Max.   :1.0000  
#> 
ds2 <- select(dataset, -c(ï..ID, Dt_Customer, Z_CostContact, Z_Revenue))

Visualize data

Visualize bar plots

  • Some categories of Marital status can be packed together.
  • Some categories of Education can be packed together.
  • MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds have long tail distribution.
  • AcceptedCmp1, AcceptedCmp2, AcceptedCmp3, AcceptedCmp4, AcceptedCmp5, Complain, Response have majority of values 0.
column_names <- names(ds2)

bar_plots_list <- list()

# Iterate over each column and create a barplot using ggplot
for (col in column_names) {
  p <- ggplot(ds2, aes(x = .data[[col]])) +
    geom_bar() +
    labs(title = col)
  
   bar_plots_list[[col]] <- p
}

num_cols_per_page <- 2
num_plots <- length(column_names)

for (i in seq(1, num_plots, by = num_cols_per_page)) {
  end <- min(i + num_cols_per_page - 1, num_plots)
  plots_subset <- bar_plots_list[i:end]
  grid.arrange(grobs = plots_subset, ncol = num_cols_per_page)
}

Visualize box plots

  • Year_Birth has bottom outliers.
  • Income has upper outliers.
  • MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds review if there are outliers due to skewed distribution.
column_names_box_plot <- c("Year_Birth","Income","Recency","MntWines","MntFruits","MntMeatProducts", "MntFishProducts", "MntSweetProducts", "MntGoldProds", "NumDealsPurchases","NumWebPurchases","NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth")
box_plots_list <- list()

# Iterate over each column and create a barplot using ggplot
for (col in column_names_box_plot) {
  p <- ggplot(ds2, aes(y = .data[[col]])) +
    geom_boxplot() +
    labs(title = col)
  
   box_plots_list[[col]] <- p
}

num_plots_box <- length(column_names_box_plot)

for (i in seq(1, num_plots_box, by = num_cols_per_page)) {
  end <- min(i + num_cols_per_page - 1, num_plots_box)
  plots_subset <- box_plots_list[i:end]
  grid.arrange(grobs = plots_subset, ncol = num_cols_per_page)
}

Preprocessing

The first step is removing the outliers using Interquartile Range for the following variables:

• “Year_Birth”, there were customers with more than 125 years old. • “Income” there are some very high income values.

The second step is encoding some of the categorical variables to include in the model:

• “Marital_Status” attribute, can be narrow the categories into two: 0:Single or 1: Together. • The “Education” attribute consists of the following values 2n Cycle, Basic, Graduation, Master and PhD. We decided to recode them to following 3 groups: Basic (0), Graduation (1), 2n Cycle, Master and PhD (2). With this recoding we now can better track and order these education grades (0,1,2).

#Remove outliers
# Year_birth
tquantile <- quantile(ds2$Year_Birth, probs=c(.25, .75), na.rm = FALSE) 
tiqr<- IQR(ds2$Year_Birth)                                    
tlower <- tquantile[1] - 1.5*tiqr                                     
tupper <- tquantile[2] + 1.5*tiqr                     
ds2<- subset(ds2, ds2$Year_Birth > tlower & ds2$Year_Birth <tupper)

# Income
ds2$Income <- gsub('.{3}$', '', as.character(ds2$Income))
ds2$Income <- gsub('[[:punct:]]', '', as.character(ds2$Income))
ds2$Income <- as.numeric(ds2$Income)

tquantile <- quantile(ds2$Income, probs=c(.25, .75), na.rm = TRUE) 
tiqr<- IQR(ds2$Income, na.rm=TRUE)                                    
tlower <- tquantile[1] - 1.5*tiqr                                     
tupper <- tquantile[2] + 1.5*tiqr                     
ds2<- subset(ds2, ds2$Income > tlower & ds2$Income <tupper)

# Recoding
#Marital status
ds2$Marital_Status<- recode(ds2$Marital_Status, Divorced = 0, Alone = 0, YOLO = 0, Absurd = 0, Divorced = 0, Single = 0, Widow = 0, Married = 1, Together = 1)
#Education
ds2$Education<- recode(ds2$Education, "2n Cycle" = 2, Basic = 0, Graduation = 1, Master= 2, PhD = 2)

ds3 <- ds2

summary(ds3)
#>    Year_Birth     Education     Marital_Status       Income      
#>  Min.   :1940   Min.   :0.000   Min.   :0.0000   Min.   :  1.00  
#>  1st Qu.:1959   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 35.00  
#>  Median :1970   Median :1.000   Median :1.0000   Median : 51.00  
#>  Mean   :1969   Mean   :1.446   Mean   :0.6449   Mean   : 51.12  
#>  3rd Qu.:1977   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 68.00  
#>  Max.   :1996   Max.   :2.000   Max.   :1.0000   Max.   :113.00  
#>     Kidhome          Teenhome         Recency         MntWines     
#>  Min.   :0.0000   Min.   :0.0000   Min.   : 0.00   Min.   :   0.0  
#>  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:24.00   1st Qu.:  24.0  
#>  Median :0.0000   Median :0.0000   Median :49.00   Median : 178.0  
#>  Mean   :0.4422   Mean   :0.5066   Mean   :49.01   Mean   : 306.2  
#>  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:74.00   3rd Qu.: 507.0  
#>  Max.   :2.0000   Max.   :2.0000   Max.   :99.00   Max.   :1493.0  
#>    MntFruits     MntMeatProducts  MntFishProducts  MntSweetProducts
#>  Min.   :  0.0   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
#>  1st Qu.:  2.0   1st Qu.:  16.0   1st Qu.:  3.00   1st Qu.:  1.00  
#>  Median :  8.0   Median :  68.0   Median : 12.00   Median :  8.00  
#>  Mean   : 26.4   Mean   : 165.3   Mean   : 37.76   Mean   : 27.13  
#>  3rd Qu.: 33.0   3rd Qu.: 232.0   3rd Qu.: 50.00   3rd Qu.: 34.00  
#>  Max.   :199.0   Max.   :1725.0   Max.   :259.00   Max.   :262.00  
#>   MntGoldProds    NumDealsPurchases NumWebPurchases  NumCatalogPurchases
#>  Min.   :  0.00   Min.   : 0.000    Min.   : 0.000   Min.   : 0.000     
#>  1st Qu.:  9.00   1st Qu.: 1.000    1st Qu.: 2.000   1st Qu.: 0.000     
#>  Median : 25.00   Median : 2.000    Median : 4.000   Median : 2.000     
#>  Mean   : 44.06   Mean   : 2.318    Mean   : 4.101   Mean   : 2.645     
#>  3rd Qu.: 56.00   3rd Qu.: 3.000    3rd Qu.: 6.000   3rd Qu.: 4.000     
#>  Max.   :321.00   Max.   :15.000    Max.   :27.000   Max.   :28.000     
#>  NumStorePurchases NumWebVisitsMonth  AcceptedCmp3      AcceptedCmp4    
#>  Min.   : 0.000    Min.   : 0.000    Min.   :0.00000   Min.   :0.00000  
#>  1st Qu.: 3.000    1st Qu.: 3.000    1st Qu.:0.00000   1st Qu.:0.00000  
#>  Median : 5.000    Median : 6.000    Median :0.00000   Median :0.00000  
#>  Mean   : 5.824    Mean   : 5.337    Mean   :0.07392   Mean   :0.07438  
#>  3rd Qu.: 8.000    3rd Qu.: 7.000    3rd Qu.:0.00000   3rd Qu.:0.00000  
#>  Max.   :13.000    Max.   :20.000    Max.   :1.00000   Max.   :1.00000  
#>   AcceptedCmp5      AcceptedCmp1     AcceptedCmp2        Complain      
#>  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
#>  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
#>  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.00000  
#>  Mean   :0.07302   Mean   :0.0644   Mean   :0.01361   Mean   :0.00907  
#>  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000  
#>  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
#>     Response    
#>  Min.   :0.000  
#>  1st Qu.:0.000  
#>  Median :0.000  
#>  Mean   :0.151  
#>  3rd Qu.:0.000  
#>  Max.   :1.000

Correlation

It is interesting to see the covariance matrix as we can identify some high positive correlations regarding variable “income” and “MntWines”, “MntMeatProducts”, “NumCatalogPurchases”,”NumStorePurchases” and some high negative correlations regarding income and “Kidhome” and “NumWebVisitsMonth”.

res <- cor(ds3)

corrplot(res, method="number", title= "Correlation matrix",
         tl.cex = 0.7, tl.srt = 45, tl.col = "black",
         cl.cex = 0.8, cl.ratio = 0.2, cl.align = "r",
         addCoef.col = "black", number.digits = 2,number.cex = 0.5, 
         mar = c(0, 0, 2, 0))

Scatter plots

Take a look at some scatter plots related to income and other variables.

  • The higher the education, the higher the income.
  • The lower the income the more the kids in a home.
  • There seems to be a positive relationship between income and amount spent in Wines, Fruits, Mean, Fish, Sweets and Gold. The more the income the greater amount they purchase.
  • There seems to be a positive relationship between income and Number of web, catalog and store purchases. The more the income the more number of purchases. But there is a negative relationship between income and number of web visits per month. The more the income the lower the visits to the web page.
##Scatter plots

column_names_scatter_plot <- c("Year_Birth", "Education","Kidhome","Recency","MntWines","MntFruits","MntMeatProducts", "MntFishProducts", "MntSweetProducts", "MntGoldProds", "NumDealsPurchases","NumWebPurchases","NumCatalogPurchases","NumStorePurchases","NumWebVisitsMonth")
scatter_plots_list <- list()

# Iterate over each column and create a scatter using ggplot
for (col in column_names_scatter_plot) {
  p <- ggplot(ds3, aes(x = .data[[col]], y = Income)) +
    geom_point() +
    labs(title = col)
  
   scatter_plots_list[[col]] <- p
}

num_plots_scatter <- length(column_names_scatter_plot)

for (i in seq(1, num_plots_scatter, by = num_cols_per_page)) {
  end <- min(i + num_cols_per_page - 1, num_plots_scatter)
  plots_subset <- scatter_plots_list[i:end]
  grid.arrange(grobs = plots_subset, ncol = num_cols_per_page)
}

Classification

Split of dataset

For training the model we prepare the data. This will be done in the next following steps:

  1. Set a seed, this value will be set to 123, then divide the data set into a training and test set, 70% of data for training, and 30% for testing. Done randomly without replacement using indexes.
###Divide train & test

set.seed(123)  # For reproducibility

# Generate indices for train and test sets
train_indices <- sample(seq_len(nrow(ds3)), size = floor(0.7 * nrow(ds3)), replace = FALSE)
test_indices <- setdiff(seq_len(nrow(ds3)), train_indices)

# Create train and test datasets
train <- ds3[train_indices, ]
test <- ds3[test_indices, ]
  1. Check if the response variable is unbalanced. They are, 85% are value 0 while 15% are value 1.
#proportion of response variable
prop.table(table(train$Response))*100 # display the ratio
#> 
#>        0        1 
#> 85.15878 14.84122

An oversampling is applied to balance classes to approximately 50% for each class.

train_balanced <- ovun.sample(Response ~ ., data = train, method = "over")$data
prop.table(table(train_balanced$Response))*100 # display the ratio
#> 
#>        0        1 
#> 50.13354 49.86646

Modelling

Naive Bayes classifier

When running Naive Bayes classifier it was obtained an accuracy of 76%, 57% Recall and 35% Precision.

nb_model <- naiveBayes(Response ~ ., data = train_balanced)
predictions <- predict(nb_model, newdata = test, type = "class")

cm_nb <- confusionMatrix(predictions, factor(test$Response), positive = "1")

print(cm_nb)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 449  44
#>          1 109  60
#>                                           
#>                Accuracy : 0.7689          
#>                  95% CI : (0.7349, 0.8005)
#>     No Information Rate : 0.8429          
#>     P-Value [Acc > NIR] : 1               
#>                                           
#>                   Kappa : 0.3042          
#>                                           
#>  Mcnemar's Test P-Value : 2.29e-07        
#>                                           
#>             Sensitivity : 0.57692         
#>             Specificity : 0.80466         
#>          Pos Pred Value : 0.35503         
#>          Neg Pred Value : 0.91075         
#>              Prevalence : 0.15710         
#>          Detection Rate : 0.09063         
#>    Detection Prevalence : 0.25529         
#>       Balanced Accuracy : 0.69079         
#>                                           
#>        'Positive' Class : 1               
#> 

Logistic regression model

When looking at the confusion matrix it can be observed an 80% accuracy, 77% Recall and 43% Precision. It is a bit better than the model with Naive Bayes.

Some of the most important variables are: Recency, StorePurchases, Marital Status, Education.

#fit logistic regression model

#disable scientific notation for model summary
options(scipen=999)

model <- glm(Response ~.,family="binomial", data=train_balanced)

#view model summary
summary(model)
#> 
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = train_balanced)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -3.8045  -0.6686  -0.1009   0.6813   2.6643  
#> 
#> Coefficients:
#>                       Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)         21.0214551 10.3529623   2.030             0.042308 *  
#> Year_Birth          -0.0120069  0.0052405  -2.291             0.021952 *  
#> Education            0.9531867  0.1072961   8.884 < 0.0000000000000002 ***
#> Marital_Status      -1.0526061  0.1097368  -9.592 < 0.0000000000000002 ***
#> Income              -0.0043446  0.0064948  -0.669             0.503532    
#> Kidhome             -0.0861538  0.1481035  -0.582             0.560760    
#> Teenhome            -1.0066680  0.1454380  -6.922 0.000000000004464730 ***
#> Recency             -0.0227471  0.0018866 -12.057 < 0.0000000000000002 ***
#> MntWines            -0.0001223  0.0002804  -0.436             0.662704    
#> MntFruits            0.0054039  0.0019625   2.754             0.005894 ** 
#> MntMeatProducts      0.0032034  0.0004016   7.976 0.000000000000001512 ***
#> MntFishProducts      0.0005507  0.0014396   0.383             0.702083    
#> MntSweetProducts     0.0043246  0.0017831   2.425             0.015294 *  
#> MntGoldProds        -0.0002267  0.0012520  -0.181             0.856320    
#> NumDealsPurchases    0.2385229  0.0393975   6.054 0.000000001410592464 ***
#> NumWebPurchases      0.0990963  0.0307425   3.223             0.001267 ** 
#> NumCatalogPurchases  0.1933626  0.0322909   5.988 0.000000002122549355 ***
#> NumStorePurchases   -0.2635514  0.0268754  -9.806 < 0.0000000000000002 ***
#> NumWebVisitsMonth    0.3413908  0.0426229   8.010 0.000000000000001151 ***
#> AcceptedCmp3         1.2104734  0.1849638   6.544 0.000000000059742433 ***
#> AcceptedCmp4         0.8845757  0.2300388   3.845             0.000120 ***
#> AcceptedCmp5         1.9456921  0.2393113   8.130 0.000000000000000428 ***
#> AcceptedCmp1         0.9050976  0.2383136   3.798             0.000146 ***
#> AcceptedCmp2         1.5188642  0.4809556   3.158             0.001588 ** 
#> Complain            -2.0488798  1.0176357  -2.013             0.044075 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 3633.5  on 2620  degrees of freedom
#> Residual deviance: 2260.1  on 2596  degrees of freedom
#> AIC: 2310.1
#> 
#> Number of Fisher Scoring iterations: 5
probabilities <- model %>% predict(test, type = "response")
predicted_classes <- ifelse(probabilities > 0.5, "1", "0")

cm_lr1 <-confusionMatrix(as.factor(predicted_classes),as.factor(test$Response), positive = "1")

print(cm_lr1)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 453  23
#>          1 105  81
#>                                           
#>                Accuracy : 0.8066          
#>                  95% CI : (0.7745, 0.8361)
#>     No Information Rate : 0.8429          
#>     P-Value [Acc > NIR] : 0.9947          
#>                                           
#>                   Kappa : 0.4472          
#>                                           
#>  Mcnemar's Test P-Value : 0.00000000000081
#>                                           
#>             Sensitivity : 0.7788          
#>             Specificity : 0.8118          
#>          Pos Pred Value : 0.4355          
#>          Neg Pred Value : 0.9517          
#>              Prevalence : 0.1571          
#>          Detection Rate : 0.1224          
#>    Detection Prevalence : 0.2810          
#>       Balanced Accuracy : 0.7953          
#>                                           
#>        'Positive' Class : 1               
#> 
# 80% accuracy
# 77% Recall
# 43% Precision
# 55% F1

#Variable importance

vlr <- varImp(model)
VI_lr<- data.frame(var=names(train_balanced[,-25]), imp=vlr)

VI_plot_lr <- VI_lr[order(VI_lr$Overall,decreasing=FALSE),]
barplot(VI_plot_lr$Overall,
        names.arg=rownames(VI_plot_lr),
        horiz=TRUE,
        col='steelblue',
        xlab='Variable Importance',
        main="Variable importance logistic regression",
        las = 2,
        cex.names = 0.65)

Principal Component Analysis (PCA)

It was observed in the covariance matrix that we had many highly correlated variables, thus we proceed to apply a Principal Component Analysis. This technique will allow to reduce the dimensionality of the data while capturing the most important patterns.

###PCA

d_pca<- prcomp(train_balanced, center = TRUE,scale. = TRUE)

pca_var<- d_pca$sdev^2

pve <- pca_var/sum(pca_var)

The variability explain by the variables can be observed in the accumulated sum graph. The 80% of the variablity is explained by the first 12 variables. Those would be used to model the logistic regression.

plot(cumsum(pve), xlab="Principal Component",
     ylab="Proportion of variation explained",
     ylim=c(0,1),
     type="b",
     main= "CUMSUM Scree Plot",
    )
abline(h=0.8, col="red")

pcadata <- data.frame(Response = train_balanced[,"Response"],d_pca$x[,1:12])

Logistic regression after PCA

After applying PCA to the train data set, the logistic regression was run again and obtained an accuracy of 96.8% for this model.Also, 97% Recall and 84% Precision.

This model seems good, but has the problem that it is not an explicative model as we cannot interpret the created variables with PCA. Then it should be considered obtaining an explainable model also, as it is relevant to explain which variables are the most important.

model3<- glm(Response ~. ,family="binomial", data=pcadata)

#view model summary

summary(model3)
#> 
#> Call:
#> glm(formula = Response ~ ., family = "binomial", data = pcadata)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -5.1897  -0.0186  -0.0001   0.0363   2.6581  
#> 
#> Coefficients:
#>             Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept) -0.47520    0.17294  -2.748             0.005999 ** 
#> PC1          1.29896    0.09996  12.995 < 0.0000000000000002 ***
#> PC2          0.14285    0.09420   1.516             0.129393    
#> PC3         -3.86215    0.28366 -13.615 < 0.0000000000000002 ***
#> PC4          4.55588    0.32293  14.108 < 0.0000000000000002 ***
#> PC5          1.22139    0.16646   7.337    0.000000000000218 ***
#> PC6         -1.50029    0.16522  -9.081 < 0.0000000000000002 ***
#> PC7         -0.66438    0.17870  -3.718             0.000201 ***
#> PC8         -0.36397    0.16531  -2.202             0.027684 *  
#> PC9          0.23314    0.18109   1.287             0.197940    
#> PC10        -3.03691    0.28000 -10.846 < 0.0000000000000002 ***
#> PC11        -0.34858    0.19699  -1.770             0.076801 .  
#> PC12        -1.58912    0.22844  -6.956    0.000000000003489 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 3633.46  on 2620  degrees of freedom
#> Residual deviance:  317.89  on 2608  degrees of freedom
#> AIC: 343.89
#> 
#> Number of Fisher Scoring iterations: 9
# preprocess the test data as it was done with train applying pca
test.p <- predict(d_pca, newdata = test[,1:25])

predict_pca <- predict(model3, newdata=data.frame(test.p[,1:12]), type="response")

predicted_classes3 <- factor(ifelse(predict_pca >= 0.5, "1", "0"))

#confusion matrix

cm_pca <- confusionMatrix(as.factor(predicted_classes3),as.factor(test$Response), positive = "1")

print(cm_pca)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 540   3
#>          1  18 101
#>                                               
#>                Accuracy : 0.9683              
#>                  95% CI : (0.9519, 0.9803)    
#>     No Information Rate : 0.8429              
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.8869              
#>                                               
#>  Mcnemar's Test P-Value : 0.00225             
#>                                               
#>             Sensitivity : 0.9712              
#>             Specificity : 0.9677              
#>          Pos Pred Value : 0.8487              
#>          Neg Pred Value : 0.9945              
#>              Prevalence : 0.1571              
#>          Detection Rate : 0.1526              
#>    Detection Prevalence : 0.1798              
#>       Balanced Accuracy : 0.9694              
#>                                               
#>        'Positive' Class : 1                   
#> 
# 96.8% accuracy

# 97% Recall
# 84% Precision
# 90% F1

Decision tree

Decision tree model provides 81% accuracy, 43% Precision and 62% Recall.

Some of the most important variables are: MntMeatProducts, MntGoldProds, Income, MntWines, NumCatalagPurchases.

tree_over <- rpart(Response ~ ., data = train_balanced ,method = 'class')
rpart.plot(tree_over, extra = 106)

pred_tree_over <- predict(tree_over, newdata = test, type="class")

#Confusion matrix

cm_dt<-confusionMatrix(pred_tree_over, as.factor(test$Response), positive = "1")

print(cm_dt)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 473  39
#>          1  85  65
#>                                           
#>                Accuracy : 0.8127          
#>                  95% CI : (0.7808, 0.8417)
#>     No Information Rate : 0.8429          
#>     P-Value [Acc > NIR] : 0.9841          
#>                                           
#>                   Kappa : 0.4006          
#>                                           
#>  Mcnemar's Test P-Value : 0.0000532       
#>                                           
#>             Sensitivity : 0.62500         
#>             Specificity : 0.84767         
#>          Pos Pred Value : 0.43333         
#>          Neg Pred Value : 0.92383         
#>              Prevalence : 0.15710         
#>          Detection Rate : 0.09819         
#>    Detection Prevalence : 0.22659         
#>       Balanced Accuracy : 0.73634         
#>                                           
#>        'Positive' Class : 1               
#> 
#variable importance
vtree <- varImp(tree_over)

VI_dt<- data.frame(var=names(train_balanced[,-9]), imp=vtree)

VI_plot_dt <- VI_dt[order(VI_dt$Overall, decreasing=FALSE),]
par(mar = c(2, 10, 4, 2) + 0.1)
barplot(VI_plot_dt$Overall,
        names.arg=rownames(VI_plot_dt),
        horiz=TRUE,
        las = 1,
        col='steelblue',
        xlab='Variable Importance',
        main="Variable importance of the Decision Tree",
        las = 2,
        cex.names = 0.65)

Random forest

Random Forest model provides 88% accuracy, 76% Precision and 40% Recall.

Some of the most important variables are: Recency, MntGoldProds, MntMeatProducts, MntWines, NumStorePurchases.

rf_default <- train(as.factor(Response)~.,data = train_balanced, method = "rf", metric = "Accuracy", importance=TRUE)


#predict
p1 <- predict(rf_default, test)

#accuracy

cm_rf <-confusionMatrix(p1, as.factor(test$Response), positive ="1")
print(cm_rf)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 545  62
#>          1  13  42
#>                                           
#>                Accuracy : 0.8867          
#>                  95% CI : (0.8601, 0.9098)
#>     No Information Rate : 0.8429          
#>     P-Value [Acc > NIR] : 0.0007874       
#>                                           
#>                   Kappa : 0.4708          
#>                                           
#>  Mcnemar's Test P-Value : 0.00000002981   
#>                                           
#>             Sensitivity : 0.40385         
#>             Specificity : 0.97670         
#>          Pos Pred Value : 0.76364         
#>          Neg Pred Value : 0.89786         
#>              Prevalence : 0.15710         
#>          Detection Rate : 0.06344         
#>    Detection Prevalence : 0.08308         
#>       Balanced Accuracy : 0.69027         
#>                                           
#>        'Positive' Class : 1               
#> 
#variable importance

imp_rf <- varImp(rf_default)
VI_rf<- data.frame(var=names(train_balanced[,-25]), imp=imp_rf$importance)
VI_rf[,2] <- NULL

VI_plot_rf <- VI_rf[order(VI_rf$imp.1, decreasing= FALSE),]
par(mar = c(2, 10, 4, 2) + 0.1)
barplot(VI_plot_rf$imp.1,
        names.arg=rownames(VI_plot_rf),
        horiz=TRUE,
        las = 1,
        col='steelblue',
        xlab='Variable Importance',
        main="Variable importance of the Random Forest",
         las = 2,
        cex.names = 0.65)

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis model provides 81% accuracy, 44% Precision and 76% Recall.

Some of the most important variables are: AcceptedCmp5, MntMeatProducts, MntWines, NumCatalogPurchases, AcceptedCmp1, Recency.

lda <- train(as.factor(Response) ~ .,method="lda", data = train_balanced)

pred_lda <- predict(lda, test)

#confusion matrix

cm_lda <- confusionMatrix(as.factor(pred_lda), as.factor(test$Response), positive="1")
print(cm_lda)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 457  24
#>          1 101  80
#>                                           
#>                Accuracy : 0.8112          
#>                  95% CI : (0.7792, 0.8403)
#>     No Information Rate : 0.8429          
#>     P-Value [Acc > NIR] : 0.9877          
#>                                           
#>                   Kappa : 0.4521          
#>                                           
#>  Mcnemar's Test P-Value : 0.00000000001063
#>                                           
#>             Sensitivity : 0.7692          
#>             Specificity : 0.8190          
#>          Pos Pred Value : 0.4420          
#>          Neg Pred Value : 0.9501          
#>              Prevalence : 0.1571          
#>          Detection Rate : 0.1208          
#>    Detection Prevalence : 0.2734          
#>       Balanced Accuracy : 0.7941          
#>                                           
#>        'Positive' Class : 1               
#> 
#Variable importance

vida <- varImp(lda)

VI_lda<- data.frame(var=names(train_balanced[,-25]), imp=vida$importance)

VI_plot_lda <- VI_lda[order(VI_lda$Overall, decreasing=FALSE),]
par(mar = c(2, 10, 4, 2) + 0.1)
barplot(VI_plot_lda$Overall,
        names.arg=rownames(VI_plot_lda),
        horiz=TRUE,
        las = 1,
        col='steelblue',
        xlab='Variable Importance',
        main="Variable Importance LDA",
        las = 2,
        cex.names = 0.65)

Metrics comparison

It can be seen that the best model is the logistic regression after PCA with 96.8% accuracy, 97% Recall and 84% Precision. As we intend to classify an unbalanced data set where the positive class is important, the precision is a valuable metric to use.

comparecm<-cbind(cm_nb$byClass,cm_lr1$byClass,cm_pca$byClass,cm_dt$byClass,cm_rf$byClass, cm_lda$byClass)
comparecm2<-cbind(cm_nb$overall["Accuracy"],cm_lr1$overall["Accuracy"],cm_pca$overall["Accuracy"],cm_dt$overall["Accuracy"],cm_rf$overall["Accuracy"], cm_lda$overall["Accuracy"])
comparison <- data.frame(rbind(comparecm2,comparecm))
colnames(comparison) <- c("Naive Bayes", "Logistic Regression","Logistic Regression after PCA", "Decision Tree", "Random Forest", "LDA")
print(comparison)
#>                      Naive Bayes Logistic Regression
#> Accuracy              0.76888218           0.8066465
#> Sensitivity           0.57692308           0.7788462
#> Specificity           0.80465950           0.8118280
#> Pos Pred Value        0.35502959           0.4354839
#> Neg Pred Value        0.91075051           0.9516807
#> Precision             0.35502959           0.4354839
#> Recall                0.57692308           0.7788462
#> F1                    0.43956044           0.5586207
#> Prevalence            0.15709970           0.1570997
#> Detection Rate        0.09063444           0.1223565
#> Detection Prevalence  0.25528701           0.2809668
#> Balanced Accuracy     0.69079129           0.7953371
#>                      Logistic Regression after PCA Decision Tree Random Forest
#> Accuracy                                 0.9682779    0.81268882    0.88670695
#> Sensitivity                              0.9711538    0.62500000    0.40384615
#> Specificity                              0.9677419    0.84767025    0.97670251
#> Pos Pred Value                           0.8487395    0.43333333    0.76363636
#> Neg Pred Value                           0.9944751    0.92382812    0.89785832
#> Precision                                0.8487395    0.43333333    0.76363636
#> Recall                                   0.9711538    0.62500000    0.40384615
#> F1                                       0.9058296    0.51181102    0.52830189
#> Prevalence                               0.1570997    0.15709970    0.15709970
#> Detection Rate                           0.1525680    0.09818731    0.06344411
#> Detection Prevalence                     0.1797583    0.22658610    0.08308157
#> Balanced Accuracy                        0.9694479    0.73633513    0.69027433
#>                            LDA
#> Accuracy             0.8111782
#> Sensitivity          0.7692308
#> Specificity          0.8189964
#> Pos Pred Value       0.4419890
#> Neg Pred Value       0.9501040
#> Precision            0.4419890
#> Recall               0.7692308
#> F1                   0.5614035
#> Prevalence           0.1570997
#> Detection Rate       0.1208459
#> Detection Prevalence 0.2734139
#> Balanced Accuracy    0.7941136

Important variables

Regarding variable importance, the following table contains the most important variables considered by each model to predict the acceptance of a campaign or not.

The threshold to count it as important is if the model assigned a weight of at least the 50% of the highest weight.

The important variables that repeat the most are “Recency”,“MntMeatProducts”, “MntWines”.

Other important variables are “Income”,“Education”, “Marital_Status”, “Teenhome”, “NumCatalogPurchases”, “NumStorePurchases”, “NumWebVisitsMonth”, “AcceptedCmp3”, “AcceptedCmp5”.

variables <- cbind( vlr,vtree, imp_rf$importance["0"], vida$importance)
names(variables) <- c('Decision tree', 'Logistic regression', 'Random forest','LDA')

result_list <- list()

for (col_name in colnames(variables)) {
  col_idx <- which(colnames(variables) == col_name)
  max_value <- max(variables[, col_idx])
  threshold <- 0.5 * max_value
  rows_above_threshold <- rownames(variables)[variables[, col_idx] >= threshold]
  result_list[[col_name]] <- rows_above_threshold
}

important_variables <- unlist(result_list)
variable_counts <- table(important_variables)

sorted_variable_counts <- sort(variable_counts, decreasing = TRUE)

print(sorted_variable_counts)
#> important_variables
#>             Recency     MntMeatProducts            MntWines        AcceptedCmp3 
#>                   4                   3                   3                   2 
#>        AcceptedCmp5           Education              Income      Marital_Status 
#>                   2                   2                   2                   2 
#>           MntFruits NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
#>                   2                   2                   2                   2 
#>            Teenhome        AcceptedCmp1     MntFishProducts        MntGoldProds 
#>                   2                   1                   1                   1 
#>    MntSweetProducts   NumDealsPurchases     NumWebPurchases          Year_Birth 
#>                   1                   1                   1                   1

Summary

Running a PCA previous to a logistic regression provided a massive result with an accuracy of 96.8%, 97% Recall and 84% Precision. Yet, this model has a disadvantage that we cannot identify which variables are the most important to predict the acceptance of a campaign or not. This knowledge may be crucial for the marketing department, thus, we run other models (decision tree, random forest and LDA) to obtain the variable importance and compare them to obtain better insights. The most important variables are “Recency”,“MntMeatProducts”, “MntWines”.

Then, the logistic regression after PCA model should be used to predict the acceptance of a campaign but the variable importance information could be also used for marketing purposes.

Customer segmentation

It was created some aggregated variables:

  • ¨Variable “Age” after “Year_Birth”,
  • ¨Variable “have_kids” combining “kidhome” and “teenhome”.
  • ¨Variable “total_amount_spent” combining variables “MntWines”, “MntFruits” ,“MntMeatProducts”, “MntSweetProducts”, “MntFishProducts”, “MntGoldProds”.
  • ¨Variable “quantity_of_purchases” combining variables “NumCatalogPurchases”, ”NumStorePurchases”, ”NumWebPurchases”.
#for this task we will disregard the campaign acceptance and focus on customer data and consuming habbits.

ds4 <- select(ds3, -c(AcceptedCmp3,AcceptedCmp1, AcceptedCmp2, AcceptedCmp4, AcceptedCmp5, Complain, Response))

#we proceed to transform some of the variables for a more comprehensive analysis.

#Age as of 2023

ds4$age <- 2023 - ds4$Year_Birth

# Divide age in categories. The values were chosen arbitrarily.

ds4$category_age <- ifelse(ds4$age<=45,"Young",ifelse(ds4$age>=63,"Old","Adults"))

#How many kids have a client in total and a dummy variable if has kids or not
ds4$quantity_of_kids <- ds4$Kidhome+ds4$Teenhome
ds4$have_kids <- ifelse(ds4$quantity_of_kids>0,1,0)

#Total amount spent by each client
ds4$total_amount_spent <- ds4$MntWines+ds4$MntFruits+ds4$MntMeatProducts+ds4$MntFishProducts+ds4$MntSweetProducts+ds4$MntGoldProds

#How many purchases make each client
ds4$quantity_of_purchases <- ds4$NumWebPurchases+ds4$NumCatalogPurchases+ds4$NumStorePurchases

#We will drop some variables we used to transform data

ds4 <- select(ds4, -c(Year_Birth, Kidhome, Teenhome))

summary(ds4)
#>    Education     Marital_Status       Income          Recency     
#>  Min.   :0.000   Min.   :0.0000   Min.   :  1.00   Min.   : 0.00  
#>  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 35.00   1st Qu.:24.00  
#>  Median :1.000   Median :1.0000   Median : 51.00   Median :49.00  
#>  Mean   :1.446   Mean   :0.6449   Mean   : 51.12   Mean   :49.01  
#>  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.: 68.00   3rd Qu.:74.00  
#>  Max.   :2.000   Max.   :1.0000   Max.   :113.00   Max.   :99.00  
#>     MntWines        MntFruits     MntMeatProducts  MntFishProducts 
#>  Min.   :   0.0   Min.   :  0.0   Min.   :   0.0   Min.   :  0.00  
#>  1st Qu.:  24.0   1st Qu.:  2.0   1st Qu.:  16.0   1st Qu.:  3.00  
#>  Median : 178.0   Median :  8.0   Median :  68.0   Median : 12.00  
#>  Mean   : 306.2   Mean   : 26.4   Mean   : 165.3   Mean   : 37.76  
#>  3rd Qu.: 507.0   3rd Qu.: 33.0   3rd Qu.: 232.0   3rd Qu.: 50.00  
#>  Max.   :1493.0   Max.   :199.0   Max.   :1725.0   Max.   :259.00  
#>  MntSweetProducts  MntGoldProds    NumDealsPurchases NumWebPurchases 
#>  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000    Min.   : 0.000  
#>  1st Qu.:  1.00   1st Qu.:  9.00   1st Qu.: 1.000    1st Qu.: 2.000  
#>  Median :  8.00   Median : 25.00   Median : 2.000    Median : 4.000  
#>  Mean   : 27.13   Mean   : 44.06   Mean   : 2.318    Mean   : 4.101  
#>  3rd Qu.: 34.00   3rd Qu.: 56.00   3rd Qu.: 3.000    3rd Qu.: 6.000  
#>  Max.   :262.00   Max.   :321.00   Max.   :15.000    Max.   :27.000  
#>  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth      age      
#>  Min.   : 0.000      Min.   : 0.000    Min.   : 0.000    Min.   :27.0  
#>  1st Qu.: 0.000      1st Qu.: 3.000    1st Qu.: 3.000    1st Qu.:46.0  
#>  Median : 2.000      Median : 5.000    Median : 6.000    Median :53.0  
#>  Mean   : 2.645      Mean   : 5.824    Mean   : 5.337    Mean   :54.1  
#>  3rd Qu.: 4.000      3rd Qu.: 8.000    3rd Qu.: 7.000    3rd Qu.:64.0  
#>  Max.   :28.000      Max.   :13.000    Max.   :20.000    Max.   :83.0  
#>  category_age       quantity_of_kids   have_kids      total_amount_spent
#>  Length:2205        Min.   :0.0000   Min.   :0.0000   Min.   :   5.0    
#>  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:  69.0    
#>  Mode  :character   Median :1.0000   Median :1.0000   Median : 397.0    
#>                     Mean   :0.9488   Mean   :0.7152   Mean   : 606.8    
#>                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1047.0    
#>                     Max.   :3.0000   Max.   :1.0000   Max.   :2525.0    
#>  quantity_of_purchases
#>  Min.   : 0.00        
#>  1st Qu.: 6.00        
#>  Median :12.00        
#>  Mean   :12.57        
#>  3rd Qu.:18.00        
#>  Max.   :32.00

Analysis of new variables

Total amount spent vs demographics

  • Individuals without kids spend more.
  • Individuals that are graduated or have masters+ spend more than individuals with basic education.
  • Individuals that are older spend more than the younger ones.
  • Individuals with higher income spend more.
ggplot(ds4, aes(y= total_amount_spent, x= as.factor(have_kids), group = have_kids, fill = as.factor(have_kids)))+geom_boxplot()+labs(title ="Boxplot have kids") #no kids, more spent

ggplot(ds4, aes(y= total_amount_spent, x= as.factor(Marital_Status), group = Marital_Status, fill = as.factor(Marital_Status)))+geom_boxplot()+labs(title ="Boxplot marital status") #no effect

ggplot(ds4, aes(y= total_amount_spent, x= as.factor(Education), group = Education, fill = as.factor(Education)))+geom_boxplot()+labs(title ="Boxplot Education")#Graduated and phd spent more than non graduate ones

ggplot(ds4, aes(y= total_amount_spent, x= category_age, group = category_age, fill = category_age))+geom_boxplot()+labs(title ="Boxplot age category")#The older the more they spent

ggplot(ds4, aes(y= total_amount_spent, x= Income, group = Income, fill = Income))+geom_boxplot()+labs(title ="Boxplot income")#The higher the income the more they spent

Quantity of purchases vs demographics

  • Individuals without kids purchased more times.
  • Individuals that are graduated or have masters+ purchased more times than individuals with basic education.
  • Individuals that are older purchased more times than the younger ones.
  • Individuals with higher income purchased more times.
ggplot(ds4, aes(y= quantity_of_purchases, x= as.factor(have_kids), group = have_kids, fill = as.factor(have_kids)))+geom_boxplot()+labs(title ="Boxplot have kids") #no kids, purchased more times.

ggplot(ds4, aes(y= quantity_of_purchases, x= as.factor(Marital_Status), group = Marital_Status, fill = as.factor(Marital_Status)))+geom_boxplot()+labs(title ="Boxplot Marital Status") #no effect

ggplot(ds4, aes(y= quantity_of_purchases, x= as.factor(Education), group = Education, fill = as.factor(Education)))+geom_boxplot()+labs(title ="Boxplot Education")#Graduated and phd spent more than non graduate ones

ggplot(ds4, aes(y= quantity_of_purchases, x= category_age, group = category_age, fill = category_age))+geom_boxplot()+labs(title ="Boxplot Age category")#The older the more times they purchase

ggplot(ds4, aes(y= quantity_of_purchases, x= Income, group = Income, fill = Income))+geom_boxplot()+labs(title ="Boxplot Income")#The higher the income the more times they purchase

After the analysis it was decided to cluster based on total amount spent and quantity of purchases and use demographic information of clients to build the profiling of the clusters.

Clustering K-means

ds5 <- select(ds4, c(total_amount_spent, quantity_of_purchases))
ds4$category_age<-sapply(as.factor(ds4$category_age), unclass)

Elbow method

The elbow method suggest 2 to 3 clusters.

fviz_nbclust(ds5, kmeans, method = "wss") 

Silhouette

The silhouette method also suggest 2 to 3 clusters.

fviz_nbclust(ds5, kmeans, method = "silhouette")

### Clustering

km <- kmeans(ds5, centers = 3)
#visualize clusters
fviz_cluster(km, data = ds5, xlab="Total Amount Spent", ylab = "Quantity of Purchases",) +
  theme(plot.title = element_text(hjust = 0.5, size = 16))

Mean values of each variable for each cluster

#add column to dataset
ds4$kmean <- km$cluster
#cluster characterization
aggregate(ds4,list(ds4$kmean),mean)
#>   Group.1 Education Marital_Status   Income  Recency  MntWines MntFruits
#> 1       1  1.464789      0.6225352 76.80000 50.79437 833.44789 66.456338
#> 2       2  1.477537      0.6405990 65.22296 48.80865 473.58902 43.425957
#> 3       3  1.425941      0.6533227 37.04243 48.59808  75.73419  6.827862
#>   MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
#> 1       522.81408        95.35493        70.092958     81.67887
#> 2       228.91514        61.27121        43.717138     70.54908
#> 3        33.09528        10.07046         6.934347     20.61649
#>   NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
#> 1          1.515493        5.552113           6.1690141          8.335211
#> 2          2.728785        6.093178           4.3777038          8.660566
#> 3          2.349079        2.729384           0.8102482          3.744596
#>   NumWebVisitsMonth      age category_age quantity_of_kids have_kids
#> 1          3.442254 54.80000     1.833803        0.2873239 0.2507042
#> 2          4.379368 56.68220     1.715474        0.7504160 0.6439268
#> 3          6.336269 52.65092     1.780624        1.2321857 0.8815052
#>   total_amount_spent quantity_of_purchases kmean
#> 1          1669.8451             20.056338     1
#> 2           921.4676             19.131448     2
#> 3           153.2786              7.284227     3

Barplot: individuals per cluster

#barplot
ggplot(ds4, aes( x=factor(kmean)))+ geom_bar() + labs(title = "Distribution of Clusters") + xlab("k-mean") +
scale_x_discrete(labels=c("3" = "Low spenders", "1" = "Mid spenders", "2" = "High spenders"))+ theme(plot.title = element_text(hjust = 0.5, size = 16))

Cluster characterization

It was run the method k-means and it was decided for 3 clusters that are here characterized.

  1. Cluster 3: “Low spenders” - This group has the lower income (~37k) and tend to have more than one kid, spend the least in all the categories and tend to chase deal purchases and visit the most often the website. Performed low quantities of purchases using normally the store or web but not really the Catalog.

  2. Cluster 1: “Mid spenders” -This group has medium income (~65k) and tend to have one kid, spend a good amount in all the categories and tend to chase deal purchases and visit the website often. Performed high quantities of purchases using all the channels but with a preference for store purchases.

  3. Cluster 2: “high spenders” - This group has the higher income (~77k) and normally have no kids, spent the most in all the categories, do not chase deal purchases and do not visit often the website. Performed high quantities of purchases using all the channels but with a preference for store purchases.

Linear regression

For the Linear Regression, it was kept the same clusters and compared how much every Cluster spent in relation to the total amount spent. For the “Low Spenders” expenditure was 15%, the “Mid Spenders” had 41% and the “High Spenders” spent 44% of the overall expenditure.

clust_mid <- filter(ds4, ds4$kmean == 1)
clust_high <- filter(ds4, ds4$kmean == 2)
clust_low <- filter(ds4, ds4$kmean == 3)

a<-sum(clust_low$total_amount_spent) #190.393
b<-sum(clust_mid$total_amount_spent) #553.802
c<-sum(clust_high$total_amount_spent) #591.065
d<-a+b+c

#porcentages of total amount spent by cluster
#15% cluster_low
#41% cluster_mid
 #44% cluster_high

Cluster Low Spenders

Education, kids and deal purchases reduce the total amount spent for the “Low Spenders”. On the other hand, Income and purchases through all the channels and visits on the website increase the total amount spent.

model_clust_low <- lm(total_amount_spent~Education+Income+Marital_Status+have_kids+age+
                   Recency+NumDealsPurchases+NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth,
                   data=clust_low)

#keep only significant variables

model_clust_low_2 <-lm(total_amount_spent~Education+Income+have_kids+           NumDealsPurchases+NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth, data=clust_low)
summary(model_clust_low_2)
#> 
#> Call:
#> lm(formula = total_amount_spent ~ Education + Income + have_kids + 
#>     NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + 
#>     NumStorePurchases + NumWebVisitsMonth, data = clust_low)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -291.770  -19.079   -2.995   16.712  299.366 
#> 
#> Coefficients:
#>                      Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)         -225.1477     6.8179 -33.023 < 0.0000000000000002 ***
#> Education             -5.2097     1.7813  -2.925             0.003511 ** 
#> Income                 1.0056     0.1032   9.748 < 0.0000000000000002 ***
#> have_kids            -12.8495     3.3254  -3.864             0.000117 ***
#> NumDealsPurchases     -1.8030     0.7511  -2.401             0.016515 *  
#> NumWebPurchases       25.0731     0.5859  42.797 < 0.0000000000000002 ***
#> NumCatalogPurchases   51.0176     1.1862  43.010 < 0.0000000000000002 ***
#> NumStorePurchases     44.7064     0.8588  52.057 < 0.0000000000000002 ***
#> NumWebVisitsMonth     13.7287     0.6380  21.519 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 35.07 on 1240 degrees of freedom
#> Multiple R-squared:  0.943,  Adjusted R-squared:  0.9426 
#> F-statistic:  2564 on 8 and 1240 DF,  p-value: < 0.00000000000000022

Cluster Mid Spenders

For the “Mid Spenders” the number of catalog and store purchases are important as well as number of web visits per month. Kids produce a negative effect on the amount spent, while income has a positive effect.

model_clust_mid <- lm(total_amount_spent~Education+Income+Marital_Status+have_kids+age+
                    Recency+NumDealsPurchases+NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth,
                   data=clust_mid)


#keep only significant variables

model_clust_mid_2 <-lm(total_amount_spent~Income+have_kids+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth, 
                    data=clust_mid)
summary(model_clust_mid_2)
#> 
#> Call:
#> lm(formula = total_amount_spent ~ Income + have_kids + NumCatalogPurchases + 
#>     NumStorePurchases + NumWebVisitsMonth, data = clust_mid)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -516.34 -189.03  -41.16  151.03  786.33 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value      Pr(>|t|)    
#> (Intercept)          859.322    154.016   5.579 0.00000004847 ***
#> Income                 9.711      1.601   6.065 0.00000000344 ***
#> have_kids           -122.774     39.492  -3.109       0.00203 ** 
#> NumCatalogPurchases   10.529      5.566   1.892       0.05935 .  
#> NumStorePurchases     -6.023      4.896  -1.230       0.21949    
#> NumWebVisitsMonth     23.453      9.027   2.598       0.00977 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 256.9 on 349 degrees of freedom
#> Multiple R-squared:  0.1356, Adjusted R-squared:  0.1232 
#> F-statistic: 10.95 on 5 and 349 DF,  p-value: 0.0000000008429

Cluster High Spenders

In case of the “High Spenders”, education and income increase the amount spent significantly as well as the number of web visits per month. Because this group has the least web visits per month and they contribute much to the amount spent, finding a way to increase the web visits would be a good way to encourage them to spend more money.

model_clust_high <- lm(total_amount_spent~Education+Income+Marital_Status+have_kids+age+Recency+NumDealsPurchases+
                     NumWebPurchases+NumCatalogPurchases+NumStorePurchases+NumWebVisitsMonth, data=clust_high)

#keep only significant variables

model_clust_high_2 <-lm(total_amount_spent~Education+Income+have_kids+NumWebVisitsMonth, data=clust_high)
summary(model_clust_high_2)
#> 
#> Call:
#> lm(formula = total_amount_spent ~ Education + Income + have_kids + 
#>     NumWebVisitsMonth, data = clust_high)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -602.81 -142.49   -0.11  151.95  427.10 
#> 
#> Coefficients:
#>                   Estimate Std. Error t value           Pr(>|t|)    
#> (Intercept)        458.156     75.617   6.059 0.0000000024302054 ***
#> Education           14.034     15.572   0.901             0.3678    
#> Income               7.109      0.907   7.838 0.0000000000000212 ***
#> have_kids         -117.380     19.392  -6.053 0.0000000025162304 ***
#> NumWebVisitsMonth   12.442      4.893   2.543             0.0112 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 190.6 on 596 degrees of freedom
#> Multiple R-squared:  0.195,  Adjusted R-squared:  0.1896 
#> F-statistic: 36.08 on 4 and 596 DF,  p-value: < 0.00000000000000022